A Bayesian approach for estimating the uncertainty on the contribution of nitrogen fixation and calculation of nutrient balances in grain legumes

Background The proportion of nitrogen (N) derived from the atmosphere (Ndfa) is a fundamental component of the plant N demand in legume species. To estimate the N benefit of grain legumes for the subsequent crop in the rotation, a simplified N balance is frequently used. This balance is calculated as the difference between fixed N and removed N by grains. The Ndfa needed to achieve a neutral N balance (hereafter \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta$$\end{document}θ) is usually estimated through a simple linear regression model between Ndfa and N balance. This quantity is routinely estimated without accounting for the uncertainty in the estimate, which is needed to perform formal statistical inference about \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta$$\end{document}θ. In this article, we utilized a global database to describe the development of a novel Bayesian framework to quantify the uncertainty of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta$$\end{document}θ. This study aimed to (i) develop a Bayesian framework to quantify the uncertainty of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta$$\end{document}θ, and (ii) contrast the use of this Bayesian framework with the widely used delta and bootstrapping methods under different data availability scenarios. Results The delta method, bootstrapping, and Bayesian inference provided nearly equivalent numerical values when the range of values for Ndfa was thoroughly explored during data collection (e.g., 6–91%), and the number of observations was relatively high (e.g., \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ge 100$$\end{document}≥100). When the Ndfa tested was narrow and/or sample size was small, the delta method and bootstrapping provided confidence intervals containing biologically non-meaningful values (i.e. < 0% or > 100%). However, under a narrow Ndfa range and small sample size, the developed Bayesian inference framework obtained biologically meaningful values in the uncertainty estimation. Conclusion In this study, we showed that the developed Bayesian framework was preferable under limited data conditions ─by using informative priors─ and when uncertainty estimation had to be constrained (regularized) to obtain meaningful inference. The presented Bayesian framework lays the foundation not only to conduct formal comparisons or hypothesis testing involving \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta$$\end{document}θ, but also to learn about its expected value, variance, and higher moments such as skewness and kurtosis under different agroecological and crop management conditions. This framework can also be transferred to estimate balances for other nutrients and/or field crops to gain knowledge on global crop nutrient balances. Supplementary Information The online version contains supplementary material available at 10.1186/s13007-024-01261-9.

framework can also be transferred to estimate balances for other nutrients and/or field crops to gain knowledge on global crop nutrient balances.

Background
The biological nitrogen (N) fixation is an essential process in legume species.This process is highly relevant in agroecosystems because it represents a sustainable strategy to possibly increase soil N stock [1], reducing the dependence on N fertilizers and thus minimizing agriculture's environmental footprint [2].The fixed N is stored in crop tissues until harvest, where a fraction of this N is exported with grains and other remains in the field as stover.The proportion of N that comes from the N fixation process, with respect to the crop N demand, is termed as N derived from the atmosphere (Ndfa).This quantity is typically computed as: There is a general consensus in grain legume studies to compare the Ndfa with the proportion of N allocated to the grains (i.e., the N harvest index, NHI) for estimating N gains or losses in the cropping system [3,4].When the Ndfa (relative N input) is greater than the NHI (relative N output), legumes are expected to contribute with N to the overall soil N balance.The N balance of legume crops is usually estimated as the difference between the quantity of fixed N by the crop and removed N by harvestable organs (both expressed in kg ha −1 ).This is a simplification to calculate the N balance in agroecosystems because other N inputs (synthetic N fertilizers, manures, atmospheric N depositions, irrigation water) and outputs (leaching, volatilization, denitrification, surface runoff ) are not considered [5,6].However, for the purposes of this study, it is defined that: (i) when the contribution of the belowground N (N in roots and rhizodeposition) is excluded, the N balance is referred to as partial N balance (PNB); and (ii) when this plant fraction is included, it is termed total N balance (TNB).
The Ndfa needed to achieve a neutral PNB or TNB can be estimated as a function of Ndfa [7,8].Usually, the relationship between Ndfa and PNB or TNB is described with the simple linear regression model (Fig. 1): where y i is the PNB or TNB (in kg ha −1 ) for the ith obser- vation, x i is the ith observation of Ndfa (predictor vari- able), β 0 (intercept) is the expected PNB or TNB when the crop did not fix N (i.e., when x i = 0 ), β 1 (slope) is the Ndfa = Fixed N kg ha −1 N uptake kg ha −1 • 100% (1) y i = β 0 + β 1 x i + ε i , change in PNB or TNB per unit of Ndfa (note that our previous knowledge on the subject allows us to assume β 1 ≥ 0 ), and ε i is the residual error.After fitting the model, the expected value of PNB or TNB is set to zero, that is E y i = 0 .By doing so, it is said that PNB or TNB are expected to be neutral.Then, according to Eq. 1, the Ndfa to get a neutral PNB or TNB can be determined by finding the x value when E(y i ) = 0 .We label this quantity θ , and it can be calculated according to as it is shown in Fig. 1.
For modeling purposes, it is important to consider that the Ndfa is a proportion that can only take values between 0 and 100.Likewise, θ is the Ndfa to achieve a neutral N balance (PNB or TNB) that also can only take values between 0 and 100.Thus, 0 ≤ θ ≤ 100 .After θ is estimated using Eq. 2, its uncertainty must be quantified to enable formal statistical inference.Perhaps the two most common approaches to quantify the uncertainty are the delta method [9] and bootstrapping [10], which enable the calculation of standard errors and confidence intervals on the estimates for θ .The delta method is an asymptotically (large sample size) based technique that implements Taylor series approximation to approximate the variance of a function of a random variable.On the other hand, bootstrapping is a (2) θ = −β 0 β 1 , Fig. 1 Illustration of the estimation of the Ndfa (proportion of the total crop N derived from the atmosphere) needed to achieve a neutral partial or total N balance ( θ ).That quantity called θ is calculated as a function of the intercept ( β 0 ) and of the slope ( β 1 ) of the common linear regression model for partial or total N balance ( y ) as a function of Ndfa ( x) computational technique based on a resampling of the observed data.Alternatively, a third approach, Bayesian inference, may also be used to estimate θ and quantify its uncertainty.
A Bayesian framework is convenient for scenarios with a limited number of observations, especially when previous information exists [11].Bayesian inference is a statistical technique based on Bayes and Laplace's work early in the 1700's, however, in the past 40 years rapid computational advancements have made Bayesian methods usable and accessible to scientists [12].Bayesian inference appears to be underutilized in legume research for estimating θ and quantifying its uncer- tainty [7,8,13,14].In this article we will present how Bayesian inference, the delta method, and bootstrapping effectively address the uncertainty quantification of θ when the data at hands contain valuable informa- tion.Furthermore, we will depict how Bayesian inference can be effective to address such problems under limited data conditions.
In most cases, θ is estimated without quantifying or considering the uncertainty surrounding the estimate [7,8,13,14].This point estimate approach, while useful, does not allow for formal statistical inference which is needed to obtain reliable scientific conclusions.Furthermore, it is important to select an appropriate statistical technique that matches the biological underlying assumptions of the range of values that the variable can take.In this article we demonstrate the delta method and bootstrapping, and we describe the development of novel Bayesian framework to quantify the uncertainty of the Ndfa needed to attain neutral N balance in legume crops ( θ ).We hypothesized that under limited data conditions, the developed Bayesian framework provided better uncertainty estimations of θ than the delta method and bootstrapping, while minor differences were expected among the three methods under greater data availability.
The aims of this study were to (i) develop a Bayesian framework to quantify the uncertainty on the Ndfa needed to achieve neutral PNB or TNB in grain legume species, and (ii) contrast the use of this framework with the delta method and bootstrapping under different scenarios of data availability.The developed Bayesian framework can expand a new study niche in agriculture, offering opportunities for parameter estimations, formal statistical inference, uncertainty quantification and propagation, among other applications.Furthermore, this article also serves as a practical guide for Bayesian nonpractitioners to apply Bayesian inference in other areas of study within the field of agriculture.We present a case study with the sole objective of illustrating a potential use of this statistical framework.

Materials and methods
We illustrate the three methods (delta method, bootstrapping, Bayesian inference) for quantifying the uncertainty on θ by retrieving the data from [13].The workflow implemented in this study is depicted in the flowchart presented in Fig. 2. Furthermore, Fig. 2 provides information to future users to decide in which cases apply the Bayesian framework developed in this article.Since there is no rule for defining when datasets are small, overall, it would be justified running the proposed Bayesian framework in cases where information is available to define priors and/or the delta method and/or bootstrapping provide unreliable uncertainty estimations (e.g.θ > 100% or θ < 0%).

Data collection and description
The variables (Ndfa, fixed N, seed N) were collected in chickpea (Cicer arietinum L.), common bean (Phaseolus vulgaris L.), cowpea (Vigna unguiculata L.), faba bean (Vicia faba L.), field pea (Pisum sativum L.), lentil (Lens culinaris Medik), white lupin (Lupinus albus L.), blue lupin (Lupinus angustifolius L.), and peanut (Arachis hypogaea L.).A similar literature search was conducted to retrieve the proportion of N that is allocated to the roots and rhizodeposition with respect to the total N in the plant (above + belowground N).For more details about the criteria to select papers for the database see [13].We also included unpublished data for peanut in the current study.

Variable descriptions and calculations
The authors in [13] studied belowground N contributions retrieving the information from articles implementing physical recovery and 15 N-labelling techniques for quantifying belowground N [15].With the collected information, Palmero et al. [13] calculated a root factor as: where Below ground N and Above ground N refer to the proportions of N found below and above the ground, respectively, relative to the total plant N demand.For instance, assume a scenario where the total N in a crop (considering above and belowground structures) is 125 kg ha −1 , but this value is unknown.What is known is the amount of N (kg ha −1 ) in the aboveground structures of the crop and the proportion of N allocated to the roots relative to the total N in the crop.Assuming that the aboveground N is quantified at 100 kg ha −1 and the root N allocation is 20%.Therefore, we know that 100 kg of N ha −1 represents 80% of the total N in the crop.Therefore, (3) Root Factor = 1 + Below ground N Above ground N , by calculating the root factor, and applying it to estimate the total N uptake (above + belowground structures), we have the following: Thus, the root factor allows the incorporation of the N allocated to the roots when calculating the total fixed N by a crop (see Eq. 5).
The Ndfa (%) values, representing the proportion of total aboveground N derived from the N fixation process, were obtained through a literature review only for aboveground parts of the crop.The fixed N (kg ha −1 ) can be calculated by excluding the contribution of N from roots and rhizodeposition as follows: Total N in the crop kg ha −1 = 100 kg ha −1 • 1.25 Fixed Aboveground N kg ha −1 = Total Aboveground Uptake N kg ha −1 • Ndfa(%) 100 .
In addition, assuming Ndfa did not differ between above-and below-ground structures [16], the Ndfa estimated for the aboveground tissues can be used to estimate the fixed N considering the belowground N contribution according to Eq. 3 and Eq. 4 as Lastly, the Fixed Aboveground N and Total Fixed N can be used to calculate the PNB and TNB, respectively, as follow: If either of these balances are positive, it means that the fixed N (excluding (Eq.6) or including (Eq.7) belowground N contribution) is greater than the N exported in (5) Total Fixed N kg ha PNB (kg ha −1 ) = Fixed Aboveground N (kg ha −1 ) TNB (kgha −1 ) = Total Fixed N (kg ha −1 ) − Seed N (kg ha −1 ).
Fig. 2 Flowchart depicting the workflow implementing in this study and showing the cases where it is worth applying the developed Bayesian framework seeds and a net soil N input occurs, resulting in a positive N balance.On the other hand, negative values indicate that the fixed N was not enough to compensate the N exported in seeds and a net soil N reduction takes place, resulting in a negative N balance.

Statistical models
In this section, we will provide details about the different approaches to estimate the parameter of interest and introduce a few modifications in the original model [7,8] to incorporate previous knowledge in the statistical model.

Regression model
First, we used a simple linear regression model for PNB or TNB as a function of Ndfa.We introduced this model in Eq. 1 as where y i , x i , β 0 , β 1 , and ε i have the same interpretation than that mentioned in the background section for Eq. 1.
This model has a deterministic part, β 0 + β 1 x i , and a ran- dom part, ε i , which is usually assumed that ε i ∼ N (0, σ 2 ) .The Ndfa needed to achieve neutral N balances in grain legume species (called θ ) can be calculated as a function of β 0 and β 1 .The delta method and bootstrapping can be implemented to quantify the uncertainty of θ.
The delta method allows us to approximate sampling distributions for functions of random variables.Since both β 0 and β 1 have their own variance estimation and θ is a function of the β′s , the delta method can be applied to estimate the variance of θ [17].Then, under the assumption that the sampling distribution of the ratio −β 0 β 1 is asymptotically normally distributed, the approximated variance of θ (obtained via delta method) can be used to construct confidence intervals.
The bootstrapping technique utilizes the plug-in principle to estimate the population distribution based on the empirical distribution of the observed data [10].Applying this computational technique to a linear model consists of taking K random samples (with replacement) of the same size as the original data (n), and then fitting the linear model to each of the K samples of size n.Finally, the K estimates are utilized to construct the confidence interval of θ .In this study, K was equal to 10,000.

Bayesian inference
Bayesian inference offers an alternative to solve the challenges found when implementing the delta method and bootstrapping for this particular study.Through Bayesian inference, the estimation of the model parameters and their variability can be regularized by including previous knowledge into the model [18].For example, in Fig. 1, values of θ greater than 100 are not possible.Therefore, it is realistic to incorporate this assumption into our linear regression model.Up to this point, the model as presented in Eq. 1 can be implemented under any framework.However, a Bayesian framework is needed in some situations because it allows us to incorporate information about the model parameters in the form of probability distributions, known as prior distributions (regulator; Table 1).
As described, the regression model in Eq. 1 does not allow to incorporate knowledge about the θ , because θ

Moments
A set of values used to quantify characteristics of a probability distribution, such as, its mean (expected value) and variance.Moments describe the shape, location, and spread of a probability distribution

Expected value
The mean of a random variable weighted according to the probability distribution.It is the first moment of the probability distribution of a random variable.It is represented as E(.)

Variance
The second central moment of the probability distribution of a random variable.It measures the degree of spread of a distribution around its mean Prior distribution An assumed function that maps the probability for a specific model parameter, that is independent of the data to be analyzed (e.g., θ ∼ N(0,2) ).The natural regulator in Bayesian models

Hyperparameter
The parameter that defines the prior distribution that is assumed fixed and known (e.g., 0 and 2 in a prior distribution for θ that is θ ∼ N(0,2))

Likelihood
A function that maps the probability or density of the model parameters given the observed data.It is the link between the data and the posterior distribution of the parameters in Bayesian models Posterior distribution Probability distribution of the parameters after observing (given) the data

Marginal posterior
Probability distribution of a single random variable in a Bayesian model that is conditional only on the data

Regularization
The process of constraining a statistical inference problem (i.e., penalization or shrinkage)

Regulator
Prior, penalty, or constraint is not directly but indirectly present in Eq. 1 by utilizing β 0 and β 1 (Eq.2; Fig. 1).In Bayesian inference any unobserved quantity is considered a random variable.Therefore, the model parameters β 0 , β 1 are random vari- ables under the Bayesian paradigm.Furthermore, a function of a random variable is also a random variable.We showed that θ = −β 0 β 1 .Therefore, θ becomes a parameter that models the proportion of N that the legume crop has to fix (Ndfa) to achieve a neutral N balance.Using the expression θ = −β 0 β 1 , we can write β 0 as a function of θ and β 1 represented as β 0 = −β 1 θ.Now, we can use the last equality to plug it in Eq. 1.Thus, the original linear model can be re-written as where y i is the PNB or TNB (kg ha −1 ) for the ith observa- tion, x i is the ith observation of Ndfa, β 1 represents the change in PNB or TNB per unit of Ndfa, θ is the Ndfa required to achieve a neutral N balance, and ε i is the residual error.Since the framework we propose in this study is Bayesian, we need to provide priors (also called parameter models) for the model parameters in Eq. 8. We have assumed that ε i ∼ N 0, σ 2 .Therefore, σ 2 ─or the standard deviation ( √ σ 2 = σ )─ is a parameter to be estimated as is β 1 and θ .The parameter models (priors) selected on this study were: (8) (10) θ ∼ beta(α, β), (11) σ ∼ gamma(2.5, 0.05).
The numerical values within the parentheses in Eq. 9-11 are the parameters that shape the probability distribution.In Bayesian inference, these parameters (e.g., 1.6 and 0.8 in Eq. 9) are referred to as hyperparameters (Table 1).In this study, we uniquely determined the values for the hyperparameters for the beta distribution ( α and β ) for each species, and their values are presented in Table 2. Setting the hyperparameters allows us to define the expected value and variance for the prior probability distributions.The computed expected values were E(β 1 ) = 1.6  0.8 , E(σ ) = 2.5 0.05 , and E(θ ) = α α+β .More details related to parameters and probability distributions selected as prior are provided in the subsequent sections.

Informative priors
The Bayesian framework consists of three core steps: (i) determining the likelihood function; (ii) capturing the knowledge about the parameters in the statistical model through the prior distribution; and (iii) combining the likelihood and the prior applying the Bayes' theorem to obtain the posterior distribution of the model parameters (Table 1; Supplementary Note 1).The priors influence the posterior distribution of the model parameters, as the posterior is a balance between the data, the likelihood, and the priors [19,20].However, the impact of the priors on the posterior is usually reduced as sample size increases.Bayesian statistics allow us to incorporate previous scientific knowledge into our model through the use of priors.The information (data or expert knowledge) used for specifying the hyperparameters must be independent from the data used to fit the parameters of the model [18].In this study, we employed informative priors for the model parameters based on independent information collected from previous studies.
Table 2 Moments and hyperparameters for the prior probability distribution of θ parameter for Partial N Balance (PNB) and Total N Balance (TNB) The moments were calculated using the collected information about NHI and the hyperparameters were obtained by moment matching based on the previously calculated moments Because a greater Ndfa means a larger proportion of the total crop N being fixed, higher Ndfa results in larger N balances, making β 1 to take positive values [7,8,14].The standard deviation parameter ( σ ) is the posi- tive square root of the variance, and thus, can take only positive real numbers.The gamma distribution is a continuous distribution of the positive real numbers that can take different shapes.Therefore, we chose the gamma distribution to represent our prior knowledge of β 1 and σ .The parameter θ represents a proportion, indicating the Ndfa required to achieve a neutral N balance.Therefore, θ can take only positive values between 0 and 1 (or 0 and 100 if scaled).Since the beta distribution models continuous random variables that can take values between 0 and 1, we selected it as prior for θ .Further- more, the beta distribution has flexible shapes, which is not the case of a standard uniform distribution.Selecting beta as prior distribution for θ restricts its values, which are in fact delimited by the nature of the Ndfa concept to be 0 ≤ θ ≤ 1 .This exemplifies how priors can behave as regulators in Bayesian models [18].

Species
The hyperparameters for the beta distribution used in Eq. 10 for θ were defined based on previous literature to best represent our prior knowledge of that parameter.However, no information was directly available for the θ .Thus, bringing Eq. 4 and Eq. 6 and considering that θ represents the Ndfa value when PNB (or TNB) is equal to zero: This result indicates that the PNB equals zero when the Ndfa is equal to the N harvest index or NHI [3,4].This calculation uses PNB rather than TNB because most current scientific literature calculates NHI without considering below-ground biomass contribution.
We conducted a literature review to collect information about the NHI of the nine legumes species included in this study.This review was independent from that utilized by [13].The literature search was conducted through Google Scholar ® , Scopus ® , and Web of Science ® search engines (last search on August 8, 2023) using the following keywords: ("legume scientific name" OR "legume vulgar names") AND ("nitrogen harvest index" OR "NHI" OR "N harvest index").We retrieved a total of 153 articles (excluding duplicates).The selection criteria were: (i) the experiments were performed in field conditions; (ii) NHI has been reported, and calculated excluding roots; and (iii) management information (e.g., N rate, sowing date, irrigation, insect, and pest management) and potential stress factors (e.g., drought, heat, nutrient deficiency) were reported.If the crop performance was severely affected by management practices or stress, the study was not included to avoid NHI values < 0 or > 1.In addition, the study must not have been included in the previous review process used to build the original database to ensure data independence to build the priors for θ PNB and θ TNB .Ultimately, out of the pooled of arti- cles, a total of 42 studies were included in the analysis (https:// figsh are.com/s/ 60a9c f527e cb9de 02166).We used the NHI values collected for each species to calculate its mean and variance, which represent the mean and the variance of the Ndfa required for a neutral PNB, herein termed as θ PNB .Then, the mean and the variance were used to calculate the parameters ( α and β) of the beta distribution via moment matching (Supplementary Note 2), which was used as prior distribution for each of the studied species (Table 2).For the Ndfa to achieve a neutral TNB, we reduced the expected value of the Ndfa in a proportion equivalent to the proportion of N that is allocated to root according to previous information [13,21].Given the low certainty of this prior information, we increased the variance of the prior distribution by 50% to account for this uncertainty.Finally, we used the recalculated mean and variance to calculate the hyperparameters of the beta distribution for θ , now referred as θ TNB (Table 2).Supplementary Fig. 1 illus- trates the discrepancies between prior probability distributions of θ PNB and θ TNB .

Model fitting and parameters of the posterior probability distributions
The model presented in Eq. 8 to Eq. 11 was fitted to the nine species in the original database [13] for both response variables (PNB and TNB).The expected value and variance of the model parameters ( β 1 , θ , and σ ) were computed from their marginal posterior probability distributions (Table 1; Supplementary Note 1) for the PNB and TNB.Subsequently, we used the estimated expected values and variances to determine the parameters of the probability distribution of β 1 , θ , and σ via moment matching (Supplementary Note 2).Those parameters are reported in Table 3, facilitating their use as priors in future applications of the method presented in this study.

Case study
In this section, we bring and describe a case study to showcase a potential use of the method presented in this article.This illustration is applied after estimating θ , and quantifying its uncertainty, to formally determine whether the PNB underestimates the true N balance in legume species.The first step is computing the PNB and TNB using independent observations.Then, these observed PNB and TNB are used to fit the Bayesian model presented in Eqs.8-11.Once the model is fitted, the posterior probability distribution for θ PNB and θ TNB are selected.The Ndfa comparison is made simply by subtracting the posterior probability distribution for θ PNB and θ TNB parameters.Then the quantiles of the desired credible interval are computed.Finally, it is observed whether zero is included in the credible interval of the distribution of the difference: (i) if zero is included, we conclude θ PNB and θ TNB are not different, (ii) if zero is not included, θ PNB and θ TNB are different.
For simplicity, we implemented this case study only for two species, chickpea, and common bean, using data from a literature review [13].Before fitting the model, we split the data into half by randomly sampling the collected studies (without replacement), which resulted in five chickpea and two common bean studies per subset.Then PNB and TNB were computed in each subset independently.The data was split to make comparisons between θ PNB and θ TNB .Next, we fitted the Bayesian model (Eq.8 to Eq. 11) to these subsets within each species.The posterior probability distribution for θ PNB and θ TNB were obtained and subtracted.The 95% credible interval was determined by computing the 0.025 and 0.975 quantiles of the probability distribution of the difference.Finally, it was observed whether zero was included in the credible interval.

Table 3 Moments and hyperparameters for the model parameter for Partial N Balance (PNB) and Total N Balance (TNB)
The moments were obtained from the posterior probability distribution and the hyperparameters were calculated using moment matching based on the previously obtained moments.For the gamma distributions, the first value between the parenthesis in the hyperparameter columns is the shape parameter ( α ) while the second is the rate parameter ( β ).For the beta distributions those values are called shape ( α and β ) The hyperparameters might be used as priors in future applications of the model presented in this study

Computation and reproducibility
The Bayesian model presented in Eq. 8 to Eq. 11 was fitted using a Markov Chain Monte Carlo (MCMC) algorithm called Non-U-Turn Sampling (NUTS).A total of 4 chains were implemented with 20,000 iterations in total and 10,000 iterations as warm-up.The convergence of the chains was assessed visually through trace plots and analytically via the Gelman-Rubic diagnostic [22].A seed was set for reproducibility.We performed the Bayesian analyses using Stan probabilistic programming language via rstan package [23].The bootstrapping technique was implemented using the rsample package [24], while the delta method was carried out with the msm package [25].All the statistical analyses were performed using the R software [26] in RStudio interface [27].The code for the analyses is publicly available in https:// github.com/ Franc iscoP almero/ Ndfa_ uncer tainty and https:// figsh are.com/s/ 60a9c f527e cb9de 02166.The databases used in this article are available at https:// figsh are.com/s/ 60a9c f527e cb9de 02166.

Delta method, bootstrapping, and Bayesian inference performance
We evaluated the delta method, bootstrapping, and Bayesian inference in two contrasting scenarios.These methods provided nearly equivalent numerical values when the range of possible values for Ndfa was thoroughly explored (e.g., 6-91%), and the number of observations was relatively high (e.g., n ≥ 100 ) (Scenario A in Fig. 3).In the opposite case (Scenario B in Fig. 3) was when the Ndfa observations were closer to their upper limit, the range of the Ndfa was poorly explored (e.g., 46-78%), and the number of observations was low.Under these conditions, the delta method and bootstrapping provided uncertainty estimates, such as confidence intervals, that contained nonviable values in the real world, i.e. less than 0% or more than 100% (Scenario B in Fig. 3).Therefore, these results show that, in a data deficiency scenario, bootstrapping and delta method could yield values outside the expected biological range, 0-100%.

Posterior probability distributions
The expected value and variance of the model parameters for the nine species, computed from their marginal posterior probability distributions, are shown in Table 2.The θ parameter values fell within the biologically plausible range of [0, 100] (Fig. 4).This depicts how priors act as regulators in Bayesian models.Initially, the hyperparameters for β 1 and σ in the gamma distribution were the same across species.However, those values are now different, which is also the case for the parameters of the beta distributions for θ (see Table 2 and Table 3).This depicts how Bayesian framework can be used to combine prior knowledge (expressed as probability distributions) with the observed data to update our understanding about a given process in Fig. 3 Point estimation and uncertainty quantification of θ implementing delta method, bootstrapping and Bayesian inference under two contrasting scenarios.In scenario A, the range of possible values for Ndfa is thoroughly explored (e.g., 6-91%), and the number of observations is relatively high ( n ≥ 100 ), and the three methods work similarly.In scenario B, observations of Ndfa are concentrated closer to its upper limit, the range of the Ndfa is poorly explored (e.g., 46-78%), and the number of observations is low (approximately 10), delta method and bootstrapping provide confidence intervals of θ that contain nonviable values in the real world.Field pea (Pisum sativum L.) (Scenario A) and white lupin (Lupinus albus L.) (Scenario B) data from [13] were implemented nature using concepts from probability theory.The influence of the observed data on the posterior probability distribution (updating process) depended on the number of observations for a given species.The greater the number of observations, the lower the influence of the prior on the posterior.This is depicted in Fig. 4 based on the proximity of the prior and posterior distribution peaks (the closer the peaks, the higher is the influence of the prior on the posterior).
We obtained the expected value and the variance of the model parameters from their marginal posterior probability distributions and utilized them to calculate the hyperparameters of their respective assumed distributions (priors).Therefore, the probability distributions  3 can be used as prior distributions on future applications of the model presented in this study.

Case study
A potential use of the proposed method was applied to chickpea and common bean to determine differences between θ PNB and θ TNB .This case study is graphi- cally represented in Fig. 5.The probability distribution for the difference between θ PNB and θ TNB are depicted through histograms.The lower and upper limits of the confidence interval were indicated with red-dashed lines.For chickpea, the absence of zero in the credible interval suggests a significantly lower Ndfa requirement for neutral N balance when belowground plant N is considered (Fig. 5).Specifically, chickpeas would require fixing between 22 and 58% less N, with a median of 41%, to achieve a neutral N balance if roots were considered.In contrast, for Common Bean, the N contribution from belowground plant structures was not substantial enough to show a significant difference in Ndfa requirement for neutrality (zero was included in the 95% credible interval; Fig. 5).
When interpreting this analysis, two main aspects should be noted: (i) The direction of the difference between θ PNB and θ TNB .In this case, the difference was computed as θ PNB − θ TNB .This means that positive val- ues indicate a lower Ndfa needed to achieve a neutral N balance when belowground plant N is considered.If the difference is computed as θ TNB − θ PNB , the conclusions about the significance of the difference are not affected (if zero is included or not), but the interpretation of the values is different.(ii) This method employs Bayesian inference, using credible intervals instead of confidence intervals common in classical analyses.The confidence intervals are understood as the proportion of experiments that would contain the true difference under a long series of replications of the experiment.While the credible interval is understood as the probability of the difference lying between the limits of the interval.

Discussion
In this study, we developed a Bayesian framework to quantify the uncertainty on the Ndfa required to achieve neutral N balances in grain legume species.This approach was contrasted with the use of delta method and bootstrapping.The developed a framework allowed us to solve common issues in fitting N balance models, like obtaining unrealistic estimates and results, especially when using small datasets.This is the first study introducing new perspectives on using Bayesian inference for reliable estimations and uncertainty quantification of the Ndfa needed to achieve neutral PNB ( θ PNB ) or TNB ( θ TNB ) in grain legume species.We demonstrated the incorporation of previous knowledge into the model, as well as providing information for improving the inference in future studies.The hyperparameters of the posterior probability distribution of model parameters were presented, which can be implemented as prior in future investigations [28].This article can also serve as a practical guide for agricultural scientists new to the Bayesian framework.Moreover, a case study demonstrated a potential application of the method, showcasing its usefulness in global estimations of N fixation contributions in grain legume species.
By applying the Bayesian framework, the probability distribution of θ was obtained for each of the addressed legume species.Previous studies analyzing the importance of grain legumes to N balance in agroecosystems reported point estimations for θ [7,8,13,14].The lack of uncertainty quantification does not enable researchers to make formal statistical inference.Therefore, the Bayesian framework proposed in this study opens new avenues for studies determining the role of grain legumes in nutrient balances in agroecosystems.Obtaining the probability distribution of θ provides us with complete information about the minimum Ndfa needed to achieve neutral PNB or TNB in grain legume species.Therefore, the Bayesian framework developed in this article lays the foundation not only to conduct formal comparisons or hypothesis testing involving θ , but also to learn about its expected value, variance, and higher moments such as skewness and kurtosis under different agroecological, soil, and crop management conditions.
Beyond Bayesian inference, other statistical techniques were explored to quantify the uncertainty of θ .One alternative was implementing the delta method to approximate the variance and confidence interval of θ [17].Additionally, bootstrapping [10] was also utilized to obtain the empirical confidence intervals of θ .Under lim- ited data conditions, these two methods provided confidence intervals for θ that contain nonviable values in the real world.Usually, in non-Bayesian statistical inference, procedures are evaluated under asymptotic behavior, i.e. under large sample sizes.Since the delta method is justified under asymptotic conditions [29], it is expected that this method will not be consistent and efficient under small sample sizes, leading to an unreliable estimation of the uncertainty of θ under low data availability.Further- more, although the delta method tends to underestimate the standard errors in comparison to bootstrapping [30], bootstrapped confidence intervals can still be erratic for small sample sizes [10].
Regularization techniques (Table 1) can be implemented to solve the issue of obtaining extremely wide or unrealistic confidence intervals on the estimation of a model parameter [18].Although regularization techniques (e.g.ridge and Lasso regressions, and random effects in linear mixed models) might introduce bias on the estimation of the model parameters (e.g.β 1 and θ ), these techniques reduce the variance of those estimations, thereby narrowing the confidence intervals of the estimated parameters.However, the classical regularization approach does not provide guidance to select the penalization term [18,31].Iterative cross-validation can be implemented to select appropriate regulator parameters [18].Nevertheless, this approach depends on out-ofsample data, which is an undesirable characteristic under limited data conditions.As well as, using cross-validation for determining the penalization term still yields point estimates for the model parameters, making it difficult or even impossible to quantify their uncertainty [18].
In the developed framework, we applied Bayesian inference for estimating and quantifying uncertainty while constraining (regularizing) the estimation of the model parameters via informative priors (regulator).Therefore, with respect to the classical regularization perspective (penalized likelihood), Bayesian models have the advantage of (i) providing formal guidance to define the hyperparameters for the priors (regulator), and (ii) utilizing formal probability theory for constraining the model parameters [31].Priors can be selected so that the influence of the prior on the posterior is minimized (which are called weakly informative priors) or based on the prior knowledge about the parameters (informative priors) [28].In this study, the use of informative priors, mainly under limited data conditions, enabled to: (i) combine independent datasets (collected from previous studies) into a simple modeling framework to obtain meaningful inference, and (ii) formally constrain the model parameters while estimating their uncertainty as it was previously depicted in ecology-related studies [20,32].
In this study, we showed that the developed Bayesian framework excelled under limited data conditions as it was shown in other field of studies such as epidemiology and medicine [33,34].However, the power of Bayesian inference under low number of observations was paid by adding stronger assumptions into the model, which were the use of informative priors.The informative priors must be correctly justified by collecting information from sources such as literature, previous experiments, expert knowledge, natural or biological conditions, among others to make the inference reliable [11].Furthermore, the developed Bayesian framework was worthwhile when the collected data were close to biological limits and the model parameter estimations had to be regularized to obtain meaningful inference [18].Under large sample sizes, the influence of the priors on the posterior is reduced [35], the delta method improves its consistency and efficiency because of their asymptotic behavior [36,37], and the bootstrapped confidence intervals are more consistent [10].Therefore, with a high number of observations, the developed Bayesian framework, bootstrapping, and delta method provided similar uncertainty quantification of θ.
Bayesian inference is an available statistical tool that seems to be underutilized in agriculture studies addressing the N contribution by legumes and the estimation of θ .The framework presented in this article can also be applied to gain knowledge about the maximum N output that a grain legume species can achieve to contribute with N to the system.Furthermore, this Bayesian framework can also be implemented to obtain the distribution of the minimum N uptake needed before a legume crop is able to start fixing N [13,21,38].However, the use of Bayesian inference in agriculture transcends these potential applications in the field of N fixation in grain legume species.Bayesian inference is a pertinent tool in agricultural sciences, where uncertainty is present everywhere, and previous and expert knowledge hold significant importance.Therefore, we hope this article also serves as an initial guide for Bayesian non-practitioners in the field of agriculture to apply Bayesian inference (regarding that other approaches can also be valid).
The databases and the code utilized in this study have been made publicly available.Science utilizes data (evidence) to answer accurate questions (hypotheses) by inductive reasoning developing new knowledge or theories that are then used as a benchmark to formulate further questions that are proved or disproved in future studies [39].Therefore, science is a continuous built from a community sharing its knowledge.By making the utilized databases and codes publicly available, our objective was to contribute to the development of a more robust, transparent, and reliable approach to advancing scientific knowledge [40].The available datasets can be implemented to test new hypotheses related to N balance in legume species, such as whether the contribution by roots is substantial in a given species (differences between θ PNB and θ TNB ), identify potential changes in the PNB or TNB per unit of Ndfa among species or within species (differences in β 1 ), or analyzing the uncer- tainty around relative N outputs (i.e.NHI) between and within species.Building such datasets demands a significant investment of time and effort.Consequently, we seek to engage additional collaborators in the ongoing process of database updating.This entails the incorporation of new studies, additional N inputs and outputs (e.g., N 2 O and/or NH 2 emissions, NO 2 leaching), and the addition of metadata, including crop rotation, weather conditions, topography, and tillage practices, among other variables.
Simplified N balances of the legume crops were implemented in this study to estimate the uncertainty of θ .Although this a common a practice to evaluate the N benefit of grain legume species to cropping systems [7,8,13,14,38,[41][42][43], this approach does not consider other N inputs and losses [5,6], leading to oversimplified N balance estimations.Therefore, upcoming research should study θ considering other N inputs and outputs beyond the fixed N and the N exported in seeds, respectively, to better understand the role of grain legume species in the N balance of the agroecosystems.Furthermore, there exist different techniques to measure the N contribution of belowground components [15].The variety of techniques along with the difficulty in recovering nodules, and thin and fragile roots generate uncertainty on the N contribution of belowground components and consequently on the root factor [13].Hence, subsequent studies should account for this variability on belowground N contribution when computing the uncertainty of θ TNB .Additionally, a higher quantity and quality of data to estimate the proportion of Ndfa derived from roots are needed to improve the accuracy of the prior information to be included in the presented Bayesian framework.

Conclusion
This study explored the use of the delta method, bootstrapping, and Bayesian inference to quantify the uncertainty of the Ndfa that grain legume species need to attain neutral PNB or TNB ( θ ).For Bayesian models, regularization is a natural consequence of using informative priors.This article depicted the usefulness of Bayesian approach to obtain meaningful inference and formally constrain the model parameters via the combination of independent data sets into the same modeling framework.Since there exists knowledge about the Ndfa needed to achieve neutral PNB or TNB in grain legume species, we expect that the use of informative priors takes more relevance when estimating this quantity and its uncertainty.Future studies should provide information of the slope ( β 1 ) and the Ndfa needed to achieve neutral PNB or TNB ( θ ) to develop informative priors, being crucial to fully embrace the potential of regularization.The developed Bayesian inference framework can be transferred to estimate balances for other nutrients and/or field crops to gain more knowledge on global crop nutrient balances.

Fig. 4 Fig. 5
Fig. 4 Histograms of samples from the prior (blue) and posterior (posterior) probability distributions for the θ parameter for Partial N Balance ( θ PNB ) and Total N Balance ( θ TNB ).The numbers inside the plot of each species indicate the number of observations.Priors are beta probability distributions based on the previously collected data from the literature, and posterior probability distributions were obtained as the marginal distributions of θ PNB and θ TNB via MCMC sampling